Stopwords and Stylometry: A Latent Dirichlet Allocation Approach
نویسنده
چکیده
We illustrate the utility of generative models for the purpose of stylometry – the science of author attribution. Though content words provide semantic handles and intuitively relate to author-styles, they are usually associated with a large vocabulary and are not consistent across corpora. On the contrary, stopwords are limited in number and do not suffer from the above mentioned issues and yet seem to retain abstract signatures of author style. We explore the use of Latent Dirichlet Allocation on stopwords and show that the resulting topic distributions provide robust handles to classify authors and help perform authorship attributions. In addition to this, we also observe that they are effective in identifying the gender of the authors.
منابع مشابه
Term Weighting Schemes for Latent Dirichlet Allocation
Many implementations of Latent Dirichlet Allocation (LDA), including those described in Blei et al. (2003), rely at some point on the removal of stopwords, words which are assumed to contribute little to the meaning of the text. This step is considered necessary because otherwise high-frequency words tend to end up scattered across many of the latent topics without much rhyme or reason. We show...
متن کاملAuthorship Attribution with Latent Dirichlet Allocation
The problem of authorship attribution – attributing texts to their original authors – has been an active research area since the end of the 19th century, attracting increased interest in the last decade. Most of the work on authorship attribution focuses on scenarios with only a few candidate authors, but recently considered cases with tens to thousands of candidate authors were found to be muc...
متن کاملAutomatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملLearning Stylometric Representations for Authorship Analysis
Authorship analysis (AA) is the study of unveiling the hidden properties of authors from a body of exponentially exploding textual data. It extracts an author’s identity and sociolinguistic characteristics based on the reflected writing styles in the text. It is an essential process for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, mo...
متن کاملPrior matters: simple and general methods for evaluating and improving topic quality in topic modeling
Latent Dirichlet Allocation (LDA) models trained without stopword removal often produce topics with high posterior probabilities on uninformative words, obscuring the underlying corpus content. Even when canonical stopwords are manually removed, uninformative words common in that corpus will still dominate the most probable words in a topic. In this work, we first show how the standard topic qu...
متن کامل